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Abstract 

In this paper a simple procedure to deal with label switching when exploring com¬ 
plex posterior distributions by MCMC algorithms is proposed. Although it cannot 
be generalized to any situation, it may be handy in many applications because of its 
simplicity and very low computational burden. A possible area where it proves to be 
useful is when deriving a sample for the posterior distribution arising from finite mix¬ 
ture models, when no simple or rational ordering between the components is available. 


1 Introduction 


Label switching is a well-known and fundamental problem in Bayesian estimation of finite 


mixture models (McLachlan and Peel, 2000). The label switching problem arises when ex¬ 


ploring complex posterior distributions by Markov Chain Monte Carlo (MCMC) algorithms 
because the likelihood of the model is invariant to the relabelling of mixture components. 

Since there are as many maxima as there are permutations of the indices (G!), the 
likelihood has then multiple global maxima. This is a minor problem (if a problem at 
all) when we perform classical inference, since any maximum leads to a valid solution and 
inferential conclusions are the same regardless of which one is chosen. 

On the contrary, invariance with respect to labels is a major problem when Bayesian 
inference is used. If the prior distribution is invariant with respect to the labelling as well 
as the likelihood, then the posterior distribution is multimodal. 

To make inference on a parameter specihc of a component of the mixture, a sample from 
the posterior that represent different modes would be inappropriate. An actual MCMC 
sample may or may not switch labels depending on the efficiency of the sampler. If the raw 
MCMC sampler randomly switches labels, then it is unsuitable for exploring the posterior 
distributions for component-related parameters. 
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A range of solutions has been proposed to perform inference in presence of label switching 


(Friihwirth-Schnatter, 2001 Stephens, 2000), but for complex models most of the existing 
procedures are complex and computationally expensive. 

A full solution entails obtaining valid samples for each parameter, and the methods in 
Section are designed to relabel the raw Markov chains for this purpose. Simpler solutions 
are available if we do not need posterior samples for all the parameters. 

In this paper a simple procedure based on the post MCMC relabelling of the chains to deal 
with label switching when exploring complex posterior distributions by MCMC algorithms 
is proposed. 

we can totally ignore the relabelling if the quantities of 


As pointed out in Section 2.1 


interest are label invariant. Besides the extreme case of label invariant quantities, we illus¬ 
trate in Section]^ and 1^ how to obtain a clustering and even a matrix of probabilities of units 
belonging to groups, using the raw MCMC sample without the need to fully relabel it. In 
Section we propose a method which performs a relabelling starting from a suitable cluster¬ 
ing of the samples, with the aim of using an MCMC sample to infer on the characteristics 
of the components in terms of both probabilities of each unit being in each group and the 
group parameters. The performance of the algorithm is explored via the simulation study 
discussed in Section and a case study on real data is presented in Section Section 
concludes. 


2 The relabelling problem 

Prototypical models in which the labelling issue arises are mixture models, where, for a 
sample y = (?/i,..., Hn) we assume 


{Yi\Zi =g) ^ 

where the Zi, i = 1,... ,n, are i.i.d. random variables and 

Zj G {1,..., G}, P{Zi = g) = Tig. 


The likelihood of the model is then 


n G 

HE (1) 

i=l g=l 

with y, = (/ii,..., yc) component-specihc parameters and tt = (tti, ..., tig) mixture weights. 
Equation Q is invariant under a permutation of the indices of the groups, that is, if 
(jn • • • Gg) is a permutation of (1,..., G) and tt' = ..., y,' = ..., /ij^) are 

the corresponding permutations of tt and y,, then 

L{y; fi, 77, 0) = L{y; tt ', 0). (2) 

As a consequence, the model is unidentihed with respect to an arbitrary permutation of the 
labels. 
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When Bayesian inference for the model is performed, if the prior distribution tt, 0) 
is invariant under a permutation of the indices, that is tt, 0) = Pq{pl\ tt', 0), then so is 
the posterior 

p {^ l , 77 , ( t )\ y ) oc Po(a^, tt, 0 )h( 7 /; tt, 0 ), ( 3 ) 

which is then multimodal with (at least) G\ modes. This implies that all simulated parame¬ 
ters should be switched to one among the G! symmetric areas of the posterior distribution, 
by applying suitable permutations of the labels to each MCMC draw. 


2.1 Relabelling and label switching in MCMC sampling 


In the following we assume that we obtained an MCMC sample from the posterior distribu¬ 
tion for model ([^ with a prior which is labelling invariant. We denote as : h = 1,..., if} 
the sample for the parameter 9 = (^,77,0). We assume that also the Z variable is MCMC 
sampled and denote as {[Z]^ : f = 1,..., if} the corresponding sample. 

In principle, a perfectly mixing chain should visit the points (^, 77, 0 ) and (^', 77', 0 ) with 
the same frequency. A chain with these characteristics for a model with G = 2 and where 
f{-;Pg,(j)) is the Gaussian distribution with parameters pg and 0 , J\r{pg,(j)), is depicted in 
Figure [^a), together with the posterior distribution for p. 

A chain with a less than perfect mixing may either concentrate on one mode of the 
posterior distribution (Figure [^b)) or exhibit random switches (Figure [^c)). 

A naive, but effective, solution to the relabelling issue is to use a sampler which is 
inefficient with respect to the labelling - that is, it is unlikely to switch labels - but otherwise 
efficient (Puolamaki and Kaski, 2009). This can be an ex post solution, that is, we can ignore 


the relabelling issue if we verify that we obtained a chain where no switch occurred, but it 
is impractical in general terms since it is difficult to tune a sampler so that it is inefficient 
enough to avoid label switches but not too inefficient. 

We note that the presence of label switches (or the whole issue of relabelling) is totally 
not relevant if the quantities we are interested in are invariant with respect to the labels, as 
is the case for a prediction of (?/i, 1 / 2 ) (depicted in Figure]^ top row), or the inference for the 
parameter 0. 

A particularly relevant example of invariant quantity is the probability of two units being 
in the same group, Cij = P{Zi = Zj\V), i, j = 1,..., n, whose estimate based on the sample 

is 


1 

cp - j! 


H 




'j\h\ 


( 4 ) 


h=l 


The nxn matrix G with elements Aj can be seen as an estimated similarity matrix between 
units, and the complement to one Sij = 1 — Cij as a dissimilarity matrix (note that it is not 
a distance metric as Sij = 0 does not imply that the units i and j are the same). 

Relabelling becomes relevant when we are interested, directly or indirectly, in the features 
of the G groups, for example the posterior (and predictive) distributions of component- 
related quantities such as the difference p 2 — hi or the probability of each unit belonging to 
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( 5 ) 


each group, qig = P{Zi = g\V), whose MCMC estimate is 

1 ^ 

^ig ~ ^ I h = g\ , 

h=l 

for i = 1,...,n and g = 1,... ,G. 

In Figure bottom row, we depict the posterior distribution of /i 2 — based on the 
samples {[fi 2 \h — : h = 1,..., H} obtained using the three chains. The hrst version is 

formally correct given that the model is not identihed, but it is not able to tell us what is the 
average difference between the groups. The second version does answer to our question on 
the difference between the groups but is based on a very partial exploration of the posterior. 
The third version leads to an incorrect answer. 

It is then clear that the raw MCMC sample can not be used to study the posterior 
distributions of component-related quantities such as /ig or P{Zi = glV). In order to study 
the posterior distributions of component-related quantities such as /ig, we need to dehne a 
suitable method to permute the labels at each iteration of the Markov chain. Then, the new 
labels are such that different labels do refer to different components of the mixture. 


3 Partitioning observations 


A partition of the observations, meaning a point estimate of the group for each unit, can be 
easily obtained. Doing this, however, the issue of obtaining an estimate for groups features 
(posteriors of fig) or the probability of units belonging to each group {qig in Equation ([^) 
remains open. In fact, the usual difficulties related to clustering techniques apply (for in¬ 
stance, the groups depend on the choice of the distance). A partition can be also obtained 
by maximizing the posterior distribution, notwithstanding the fact that the maximum is not 
unique (there are G\ modes), since the maxima are equivalent any would be suitable. 

Alternatively, the probabilities in Equation (|^ can be used to derive a partition of 
observations by employing some clustering technique based on a suitable similarity matrix. 

A more sophisticated option, see Fritsch and Ickstadt (2009), involves dehning a distance 
between partitions, for example 


d{z*, z) = ^di \z* 7 ^ \zi = Zk\ + d2 \zi = Zk\ \z* / 7^1 , 


( 6 ) 


i<k 


and then search for the partition which minimizes the expected distance with the true groups 
z, which means, if di = d 2 = 1, hnd z* which minimizes 


E{d{z*,z)\V) = '^\\z* = zl\- Cik\, 

i<k 


( 7 ) 


where Cik can be replaced by Cik- 

Alternative distances between partitions may be used, for instance the Rand i ndex 


4(z-.z) = i-<i(z-,n(G)) or the adjusted Rand index (^Hubert and Arabie 


Note that if the distance function is a linear operator then the following holds: 


1985| ). 


E{d{z*,z)\V)=d{z*,E{z\V)). 


( 8 ) 
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Figure 1: MCMC chains for (top row) and estimated posterior for where (a) a perfect 
mixing occurs (each of the two permutations is visited with equal frequency); (b) no switching 
is exhibited; (c) one random switch occurs. 


The expectations in ([^ or ([^ can be obtained using the MCMC sample as 

1 ^ 

BKV,z)|P) = -5^<i(V,Wn (9) 

h=l 

The optimization should be done in the space of all possible partitions, since this can be 
very large, the authors suggest performing optimization on a suitable subset, reasonable 
alternatives being the set {[2:]h} or the set of clusterings resulting from different classical 
algorithms applied to the similarity matrix (|^ (or the union of the two). 


4 Obtaining probabilities of belonging to a group 


Puolamaki and Kaski| ( |2009[ ) deal with the relabelling issue considering as an objective the 
n X G matrix with elements qig = P{Zi = g) (/d in their notation). This is obtained by 
maximizing the Bernoulli likelihood. The latter can be specihed according to two alternative 


formulations. The hrst is one in Puolamaki and Kaski (2009), where the HG x n matrix Z' 
is such that 

Zb = 1 iff [Zi]h = g where r = G{h -l)+ g, 
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Figure 2: Estimated posteriors for {yi,y 2 ) (top row) and /i 2 — p-i (bottom row) based on 
chain (a), (b), of Figure 
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and get 


R G n 


r=l g=l 2=1 

The above likelihood can also be written as 


( 10 ) 


H G n 


h=l g=l i=l 


( 11 ) 


The intnitive idea behind this strategy is that if two nnits ii and i 2 often belong to the same 
gronp, that is, for many r, then they shonld be assigned to the same gronp, thns 

leading to a high valne of qi^g and qi^g for some valne of g. Note that the likelihood above is 
itself labelling invariant, thns it has G\ maxima. 

An EM algorithm is proposed to perform the optimization: 

E step: for each row r (which represent a gronp in an iteration) and for each gronp obtain 


(lig) 


Irg 


l-2h 


eL nr=i 4 "(i“^* 9 ) 


ZL. 


\l-ZL, 


p((^lg; • ■ ■ ; Zng)\ 0 ) 

Y.g=lGp{{Zig,...,Zng)\9y 


M step: compnte the mean of the Zh with weights 

Ef=l IrgZy Er:ZC=l 7rg 


qig — 


E R 

r=l Irg l^r=l 'Irg 


R 


Eqnivalently, the matrix Q can be fonnd minimizing the cost fnnction 

H 


1 


IIEg!® 


h=i j/ev 


liu{[Zi]h)- 


5 Relabeling methods 

Relabelling means permnting the labels at each iteration of the Markov chain in snch a way 
that the relabelled chain can be nsed to draw inference on component specihc parameters. 
Loosely speaking we may say that the relabelled chain can be seen as a chain where no label 
switching has occnrred or, in other words, the new labels are snch that different labels do 
refer to different components of the mixtnre. 

One method to perform the relabelling involves imposing identihability constraints snch 
as TTi < 7r2 < ... < ttg or /ii < /r 2 < ... < Pg- Eqnivalently, this may be seen as a condi¬ 
tioning of the fnll (mnltimodal) posterior where the conditioning event is the identihability 
constraint. Snch a solntion, althongh theoretically sonnd, may not be applicable when an 
obvious constraint does not exist and it may not work well if the components are not well 


separated (Stephens, 2000 Jasra et al., 2005). 


It is worth noting that relabelling strategies may act during the MCMC sampling, and/or 
they may be used to post-process the chains. In general, those solutions which post-process 
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the chains are particularly convenient (since the issue can be ignored in performing the 
MCMC and then dealt with later). Generally, existing relabelling procedures select the per¬ 
mutation of the labels that minimizes a well dehned distance between some components, such 
as pivots and classihcation probabilities, at each MCMC iteration. Papastamoulis (2016) 


provides the lab el. switching R package with a range of deterministic and probabilistic 
methods for performing relabelling: in Section and a comparison between some of these 
alternatives and our methodology will be provided. 


5.1 Decision theoretic approach 

A rather general decision theoretic framework for the relabelling problem is proposed by 
. Such approach translates the problem to that of choosing an action a from 
A where a loss function £ : M x 0 —)■ M represents the loss we incur if we 
choose the action a and the true value of the parameter is 6. 

The loss function makes sense if it is permutation invariant (remember that if we permute 
the parameter the model remains the same), we can obtain a permutation invariant loss 
function C from a non invariant one Cq by dehning 

C{a]6) = minCo{a]v{d)). 

The action a is then chosen by minimizing the posterior expected loss 


Stephens (2000 


a set of actions 


7^(a) = £(£(a; d)\V), 


which can be approximated using the MCMC sample by 


H 


^(a) = 

h=l 
1 ^ 

= uYl min£o(a; M[^]h)) 

h ^^ Vh 


h=l 


H 


I ;^5Z^o(a;i^fc([6']/i)) 




( 12 ) 

(13) 

(14) 


h=l 


The action a can be the estimation of the parameter (or part of it) and the loss function 
may be a distribution to be htted or an estimation error, the choice should be driven by 
the objective of inference. If the objective is the clustering of n units into G groups a 
reasonable action is reporting the n x G matrix Q = [qig] where qig is the probability that 
the i-th unit belongs to the group g. A corresponding loss is then the distance, somehow 
measured ( [Stephens ( 2000[ ) employs the Kullback-Leibler distance), between Q and its true 
value P{0) = \pig{0)] where (for the toy example) 


Pig{0) = P{Zi = g\y,e) = 


'^gf iyii Pg: ^) 


The general algorithm for performing Stephens (2000) method is as follows 


















Start: from arbitrary permutations z/i, ..., uh- 
Step 1: obtain a = argmin^^^ >Co(a; h'h{[9]h)). 

a 


Step 2: obtain Vh = argminLo(a; 

fh 


Note that step 2 entails n minimizations with respect to all the permutations ((j!), [Stephens 
(2000) points out the existence of efficient numerical algorithm if the loss function Cq can 

G 

9=1 ^0 


be written as Cq = XlLi 4>)- 


A problem with this method might be the choice of the appropriate loss function/the 
dependence of the results on the loss function. Our method presented in Sect. does not 
require a minimization step, and for this reason might be computationally appealing in many 
situations. 


6 Pivotal method 

Suppose that a partition of the observations in G groups, Qi,... has been obtained as 
discussed in Section As already pointed out, this may be enough for some purposes, 
but we may be interested in the probabilities P{Zi = g) and in the posteriors for groups 
parameters, pg. 

Suppose that we can hnd G units, A,... ,ig, one for each group, which are (pairwise) 
separated with (posterior) probability one (that is, the posterior probability of any two of 
them being in the same group is zero). In terms of the matrix C, the G x G sub-matrix with 
only the row and columns corresponding to ii,... Go identity matrix. 

We then use the G units, called pivots in what follows, to identify the groups and to 


relabel the chains: for each h = 1,.. 

., H and g = 1,... ,G 



[f^glh = [f^[Zig]h\h^ 

(15) 

[Z^]h 

= g toT i : [Zi\h = 

(16) 


The availability of G perfectly separated units is crucial to the procedure, and it can 
not always be guaranteed. We now discuss three different circumstances under which the 
relabelling procedure is unsuitable 


(i) the number of actual groups in the MCMC sample is higher than G; 

(ii) the number of actual groups in the MCMC sample is lower than G; 

(iii) the number of actual groups in the MCMC sample is equal to G but the pivots are not 
perfectly separated. 

Let us hrst clarify what is meant by the number of actual groups. The model has G 
components, but some mixture components may be empty in the Markov chain, that is, it 
may happen that : [Zi]^ = g for some i} < G\/h. By actual number of groups we mean 
the number of non empty groups, Gq in what follows. It is then clear that the Markov chain 
does not have informations on more than Gq groups. 
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We also note that the number of non empty groups may vary with iterations, let 

[G]h = #{5- : [Zi]h = g for some i}. 


Consider now the set "Hi C ,H} of iterations where [G\h > G] some units and 

groups will then have no available pivot. These units will not be attributed any group by 


performing (16). Thus for these units 


G G G H 

^P{Zi = g) = ^qig = ^^^\[Zi\h = g\ < 1. 

g=l i=l 


9=1 


9=1 


We suggest cancelling those iterations of the chains where this occur, that is, the hnal - 
partial- chain is a sample from the posterior conditional on having at most G non empty 
groups. 

Consider now the set 7^2 C {1,..., TT} of iterations where [G]h < G; if h G 7^2, [Zi^h = 
[Zi^]h for some pivots 4, is- As a consequence, Chk > 0: the pivots are not perfectly separated. 


The procedure in (15) and (16) can not be performed (it is not well dehned), so also in this 
case we will have to cancel the corresponding part of the chain. Finally, consider the set 


Hs = {h : 3A;, s s.t. [Zi^]h = 

that is, the set of iterations where (at least) two pivots are put in the same group. Note that 
H2 C ^3 but 7^3 may be larger. The same provision as above applies, we need to get rid of 
this part of the chain. In the end, we will relabel the chain with iterations 


= (17) 

which can be considered a sample from the posterior distribution conditional on (i) there 
being exactly G non empty groups, (ii) the pivots falling into different groups. 


6.1 Pivots identification 


A relevant issue is how to identify the pivots, noting that perfectly separated units may not 
exist and that, even if they exist, we may not be able to hnd them since the set of all possible 
choices is too big to be fully searched. 

The general method we put forward is to select a unit for each group according to some 
criterion, for instance for group g containing units Qg we may chose i ^ Gg that maximizes 
one of the quantities 

nraxq^-, ^ Qp (18) 

* j&Gg i&Sg j^Gg 

or minimizes one of the quantities 


min Cj,, 




Cljj 


Gj- 

j^Gg 


(19) 


We introduce a further method, which we call Maxima Units Search (hereafter MUS), 
that turns out to be suitable in case of a low number of mixture component, e.g., G = 3,4. 
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This procedure differs from the others in the strategy for detecting pivots, since it does not 
rely upon a maximization/minimization step but it identifies those units satisfying a proper 
search within the estimated similarity matrix C (see the Appendix for more details on the 
MUS procedure). 

The quality of the choice of pivotal units by the proposed methods is measured by the 
probability of the conditioning event 

( 20 ) 


estimated by the (original, raw) MCMC sample. It is worth noting that the idea of solving 
the relabelling issue by fixing the group for some units dates back to Chung et ah (2004), 
who, however, gave no indication on how to choose the units. Also, since they suggest 
imposing such a restriction in the MCMC, there is no measure of the extent to which it 
influences the result (of the extent to which it is informative if we interpret it as a prior 
information). We note, however, that Chung et al. (2004) may be very interesting when a 
set of units which are to be attributed to different groups can be defined exogenously. 

Another related idea is pnt forward by Yao and Li (2014), who propose finding a reference 
labelling, that is, a clustering for the sample (for example, the posterior mode), and then 
relabel each iteration by minimizing some distance from the reference labelling. The general 
idea is similar to the one we snggest, but it is more computationally demanding because of 
the required minimizations, on the other hand it avoids the need to condition on the pivots 
being separated. We can argne, however, that the latter is not a big drawback of onr proposal 
since its effects can be measured and is likely to be small in many practical instances. 


7 Simulation study 

The aim of this section is to evaluate the performance of the pivotal method introduced 
before. In particnlar, our goal is to investigate the behavionr of the proposed solution for 
dealing with label switching in different simulated scenarios. For this purpose, we focus 
on data simulated from a mixture of non-equally weighted mixtures of bivariate Gaussian 
distribntions with nneqnal covariance matrices, so that the generated components may resnlt 
in overlapping clusters. Specifically, the simulation scheme consists in the following steps. 


Simulate n values Yi,..., Y„, from a mixture of mixtures of bivariate Gaussian distri¬ 
bntions, where 

2 

{Yi\Zi = fif) ~ ^Pg,W2(h<?^,F,). (21) 

S = 1 


That is, conditionally on being in gronp g G {1,..., G}, is picked ont from one of 
two possible Gaussian distribntions with weights, means and covariances Pgs, Pgs and 
Eg, s = 1,2, respectively. The likelihood of the model is then 


n G 

i=l g=l 


'^PgsM2G gsi 


,s=l 
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Scenario A 

Scenario B 

Scenario C 

k'ls 

(25,0) 

(-10,-10) 

(-10,-10) 

/^2s 

(60,0) 

(20,-10) 

(20,-10) 

hSs 

(0,20) 

(-10,20) 

(6,5) 

t^As 

(50,20) 

(20,20) 

(5,25) 


Table 1: Two-dimensional mean vectors used as input for the three illustrated scenarios A, 
B and C, with number of mixture components G = 4. 


(ii) Obtain an MCMC sample which effectively explores all modes of the posterior distri¬ 
bution. 

(iii) Estimate the n x n similarity matrix C with elements Cij = P{Zi = Zj\V), i,j = 
1,..., n, by Equation Q. 

(iv) Apply a suitable clustering technique based on the estimated dissimilarity matrix with 
elements Sij = 1 — Aj and obtain a partition of the observations in G groups with units 

9 = ^^ ■ ■ ■ ^G. 

(v) Detect the pivots, one for each group, according to one criterion among the ones 
discussed before. 

(vi) If necessary, discard those iterations of the chains belonging to Tfi U ^2 U Tfa (see 
Section]^ and relabel the resulting chain with iterations in Ho (see Equation 0) via 
@ and (@. 


In the following, a sample size of n = 1000 and G = 4 components are considered. For 
= 1,..., 4, we set TTg = 1/4, pgi = 0.2, pg 2 = 0.8, and Si = I 2 , S 2 = 200 12 , being I 2 the 
2x2 identity matrix. We generate our simulated data from model (21) (see Figure]^ using 
the input means reported in Table and obtain an MCMC sample by considering H = 3000 
iterations. 

We proceed following points (i)-(vi) described above. As a remark, two different clustering 
strategies are applied on the dissimilarities Sij in order to obtain G clusters of observations, 
namely the agglomerative and partitioning hierarchical clustering. Both methods only re¬ 
quire a distance or a dissimilarity matrix as input and return a set of nested clusters that 
are organized as a tree. The former starts with the points as individual clusters and, at 
each step, merges the closest pair of clusters, according to some criterion to compute cluster 
proximity; the latter starts with one, all-inclusive cluster and, at each step, splits a cluster 
until only singleton clusters of individual points remain. 

We observe that the two algorithms provide very similar results in terms of the resulting 
clusters, and do not affect the performance of the relabelling procedure. Therefore, for the 
sake of illustration, we restrict to agglomerative hierarchical clustering, where the so-called 
complete linkage is adopted as a common criterion for the computation of the dissimilarity 
between two clusters, since it is less susceptible to noise and outliers than other linkages. 

Figures |4]-[^ display the results of the agglomerative hierarchical clustering on simulated 
data from scenarios A, B and C, respectively. In each chart of Figures |4]-[^ a different method 
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(a) Scenario A (b) Scenario B 

Figure 3: Illustration of a simulated sample of size n 
4 components, according to three different scenarios, 
reported in Table 0 


(c) Scenario C 


= 1000 from model ([^ with G = 
The input means coordinates are 


for identifying the pivots is adopted ((a)-(g)), and the selected units are marked with red 
points on the plots. Recall that, by dehnition, the pivots are perfectly (pairwise) separated 
units. Therefore, the performance of the seven different identihcation methods will be higher 
as the posterior probability of any two of them being in the same group is closer to zero. 
As can be noticed, panels (b), (e), (f) and (g) seem to provide an accurate choice of the 
pivots in all situations, since they are clearly well separated and suitable as representative 
units for each group. This is not a negligible issue in terms of the relabelling performance, 
which is strongly affected by the choice of such G, by virtue of Equation (15). The estimated 
proportions of relabelled iterations based on 100 simulated samples are reported in Table 
As expected, better performances in terms of pivot selection are likely to reflect into a higher 
proportion of relabelled iterations in almost all situations. 

Coherently with the considerations drawn from Figures |4]-[^ methods (b), (e) and (f) 
register the highest chain proportions (less than 1% of the iterations is discarded) for both 
scenarios A and B. Method (c) seems to have the worst performance regardless of the con¬ 
sidered scenario, in particular, for scenario C the chain keeps only about 8% of the original 
iterations. The fact that the third simulated scenario shows globally less satisfactory results 
is not surprising. In fact, the input means are so close each other that the cluster algorithms 
may fail in recognizing the true data partition, thus reflecting on the quality of the choice 
of the pivotal units. However, (e) and (f) criteria and MUS algorithm are preferable to the 
others. 

In order to compare the proposed methodology with other relabelling algorithms, in the 
task of estimating the means of the mixture components, we consider the Puolamaki and 


Kaski procedure ( 

Puolamaki and Kaski 

, 2 

009 

) and three other methods implemented in the 

lab el. switching package ( 

Papastamoulis 

, 2016 

). In Figure^ the median estimates of rela- 


belled group means are plotted for a simulated example from scenario B and four alternative 
methods. As can be seen, our relabelling procedure seems to provide very accurate estimates 
of group means. Similar results are achieved by ECR-iterative-1, ECR and DATA-BASED, 
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while Puolamaki and Kaski algorithm appears not to yield reliable estimates for the gronp 
means. 

In Table are reported the mean sqnare errors of the relabelled estimates, obtained as 
mean over B = 100 macro-replications of the Enclidean distances between the inpnt means 
and the corresponding estimates, according to scenarios A, B and C. In all scenarios the 
highest mean sqnare errors are obtained for method (c), for each component of the mixtnre. 
Criteria (b), (e) and (f) give very similar results, and the MUS algorithm outperforms all 
other pivotal methods for three cases in Scenario A and two in Scenario B. ECR-iterative-1 
performances in terms of mean square errors are comparable with our algorithm in most 
cases, although it shows to give lower errors in the third scenario, while Puolamaki and 
Kaski algorithm seems to provide poor estimates in all situations (see also Figure]^. 



(a) 

(b) 

(c) 

(d) 

(e) 

(f) 

(g) 

Scenario A 

0.475 

0.993 

0.124 

0.506 

0.993 

0.993 

0.313 

Scenario B 

0.519 

0.998 

0.101 

0.707 

0.998 

0.998 

0.995 

Scenario C 

0.139 

0.300 

0.079 

0.267 

0.368 

0.507 

0.374 


Table 2: Estimated proportion of relabelled iterations (see Equation (20)), over 100 macro¬ 
replications, based on the original MCMC sample, according to Scenario A, B and C. The 
observations are clustered according to agglomerative hierarchical algorithm. 
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Figure 4: Simulated sample of size n = 1000 from Scenario A (see Table clustered accord¬ 
ing to agglomerative hierarchical algorithm. The pivotal units are identihed by adopting 
methods (a)-(g). 
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(g) MUS algorithm 

Figure 5: Simulated sample of size n = 1000 from Scenario B (see Table clustered accord¬ 
ing to agglomerative hierarchical algorithm. The pivotal units are identihed by adopting 
methods (a)~(g). 
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Figure 6: Simulated sample of size n = 1000 from Scenario C (see Table clustered accord¬ 
ing to agglomerative hierarchical algorithm. The pivotal units are identihed by adopting 
methods (a)~(g). 
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Figure 7: Scenario B. Crosses are input means, red points are the median values of rela¬ 
belled estimates. (Top left) Raw MCMC sample for pLg^g = (Top right) Reordered 

MCMC sample of Pivotal Method resulting from agglomerative hierarchical clustering and 
MUS algorithm. Reordered MCMC sample according to methods ECR-iterative-1, ECR, 
Puolamaki and Kaski and DATA-BASED. 
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Scenario A 


R2 

R3 

R4 

(a) 

13.8078 

1.6724 

2.0158 

10.8232 

(b) 

13.7064 

1.6104 

1.9814 

9.1846 

(c) 

22.8786 

7.5307 

6.7326 

14.5896 

(d) 

14.0215 

1.6619 

1.9951 

11.2910 

(e) 

13.7301 

1.6264 

1.8889 

9.2900 

(f) 

13.7794 

1.6723 

1.8979 

9.2897 

MUS 

12.5787 

1.5531 

1.7919 

9.6220 

ECR-1 

13.6403 

1.6605 

1.9015 

8.8085 

P&K 

25.5940 

15.5229 

15.1522 

27.2411 

Scenario B 


R2 

R3 

R4 

(a) 

1.4066 

1.6251 

1.5984 

1.5884 

(b) 

1.4123 

1.6005 

1.5737 

1.5419 

(c) 

4.8496 

4.3588 

4.8097 

5.2142 

(d) 

1.4096 

1.5961 

1.5729 

1.5403 

(e) 

1.4127 

1.6003 

1.5736 

1.5417 

(f) 

1.4121 

1.5982 

1.6192 

1.5420 

MUS 

1.4070 

1.5877 

1.5728 

1.5437 

ECR-1 

1.4129 

1.5984 

1.5717 

1.5429 

P&K 

18.4657 

18.6185 

18.6796 

19.0404 

Scenario C 

Ml 

R2 

R3 

R4 

(a) 

9.1013 

9.9974 

8.4766 

19.3288 

(b) 

6.9196 

7.8994 

8.7700 

14.1766 

(c) 

11.6894 

10.8810 

8.8252 

22.6435 

(d) 

7.7730 

9.1701 

9.1987 

16.4153 

(e) 

7.6160 

7.1054 

10.2073 

13.2589 

(f) 

7.1992 

7.1643 

9.4728 

15.2713 

MUS 

6.7458 

7.5579 

9.7924 

14.8356 

ECR-1 

6.4891 

6.7234 

8.4472 

9.3649 

P&K 

17.5726 

16.8717 

3.4988 

20.2620 


Table 3: Mean squared error 


1 ^ 
i=i 


H'gs 


■A?2l 


computed for B 


100 macro-replications, of 


the estimates of the mean vector components figs, g = 1, s = 1,2, according to criteria 
(a)-(f) and MUS of pivotal method, Puolamaki and Kaski (P&K) and ECR-1 algorithms. 
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8 Case study 


Fishery dataset, originally taken from Titterington et al. (1985) and used by Papastamoulis 


(2016) for comparing different relabelling procedures, consists of n = 256 snapper length 
measurements. In Figure]^ the histogram of the lengths is shown. In this section, we apply 
our method to these data and test its efficiency, comparing the results with some methods 
from the label.switching package. We use a Gaussian mixture with G = 5 components as 


suggested by Papastamoulis (2016), that is: 


Vi 


G 

E 

9=1 


i = l,...,n 


( 22 ) 


Histogram of x 



length 


Figure 8: Histogram of fishery data. On x-axis the snapper length measurements 


We set up a Gibbs sampling through the bayesmix R package (Griin, 2011), with H = 
11000 iterations and a burn-in period of 1000. 
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Figure 9: Fishery data. {From top left to bottom right) Raw MCMC sample for 
fig,g = Reordered MCMC samples by applying the permutations returned by 

label. switching function, according to methods ECR, ECR-iterative-1, ECR-iterative- 
2, PRA, STEPHENS, AIC, DATA-BASED. Reordered MCMC samples according to the 
pivotal method. 


In Figure the raw MCMC sample (Top left) and the reordered MCMC samples for 
/ig, g = 1 ,..., 5, for different methods, are shown. Despite an ordering constraint for the mean 
components (the priors are chosen according to the independence option, which favours a 
natural ordering of the means), label switching occurs, and the raw sampler is unable to yield 
useful means estimates for the single components. The label. switching function of the 
same package is used to reorder the obtained chains according to the resulting permutations. 
The methods from the lab el. switching package seem to perform similarly. In particular, 
for the greatest mean (light blue trace) there is a global tendency of switching. We note 
that for the DATA-BASED method the same happens also for the second mean (blue trace). 
Our pivotal method seems to work better in isolating the hve high-posterior density regions. 
We recall that the reordering for our method is explained by (l5|. 

Concerning the computational times reported in Table |4 AIC is the fastest method, 
since it only applies an ordering constraint and consequently permutes the simulated MCMC 
output, while STEPHENS —a probabilistic relabelling— is the slowest. Our method is quite 
fast, especially if compared with ECR-iterative-1, ECR-iterative-2 and DATA-BASED. 
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Method 

CPU time (sec.) 

STEPHENS 

344.50 

PRA 

3.96 

ECR 

8.66 

ECR-iterative-1 

60.83 

ECR-iterative-2 

28.39 

AIC 

0.08 

DATA-BASED 

22.68 

Pivotal 

9.57 


Table 4: Fishery data: CPU times in seconds for different methods, with H = 11000, 
bnrn-in=1000, n = 256 and G = 5. 


9 Conclusions 


We propose a simple procednre for dealing with label switching in Bayesian mixtnre models, 
based on the identihcation of as many pivots as mixtnres components, nsed for relabelling 
the resnlting MCMC chains. The main novelty of onr contribntion consists in providing 
some nsefnl indications on how to choose the pivots, since, as mentioned in Section 6.1, the 


idea of solving the relabelling issne by hxing the gronps for some nnits is not new (Chung 


et ah, 2004). We suggest to adopt one of six alternative methods based on a maximization 


or a minimization of some quantities derived from a similarity matrix obtained through the 
MCMC sample, or a further demanding algorithm suitable when the number of groups G is 
relatively small (e.g. G = 4). 

A fundamental issue is represented by the pairwise (perfect) separation between pivots, 
since it is crucial for the proposed procedure and, usually, non-trivial. 

From a computational side, the method appears to be fast and simpler than other rela¬ 
belling methods, since it does not require a maximization/minimization step at each itera¬ 
tion, and only requires a permutation of the labels induced by the pivots membership. A 
simulation study is conducted in order to test the proposed solution on different possible 
scenarios, showing overall good performances. A case study on a real dataset is presented, 
and the results seem to conhrm the advantage of using the proposed methodology. Moreover, 
when also considering the computational time of our algorithm compared to some procedures 
available in the label.switching R package, we conclude that the proposed methodology 
may represent a valid approach to the label switching problem and, in some cases, may be 
preferable to other existing solutions. 
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Appendix 


MUS algorithm 

The algorithm of Maxima Units Search is an alternative method for detecting pivots which 


does not rely upon a maximization/minimization step as the other six procedures in (18) 


and (19), but it searches for G pivots which satisfy a proper feature within the estimated 
similarity matrix C. The underlying idea is to choose as pivots those units in correspondence 
of which the G x G sub-matrix of G with only the row and columns corresponding to ii,..., 
is more often (close to) the identity matrix. Let us denote this sub-matrix of G = {cij) only 
containing the rows and columns corresponding to the pivots zi,...,Zg by It is 

worth stressing that for a small number of groups (e.g., G = 4) and a sample size n ranging 
between 100 and 1000, this research can be computationally demanding. Furthermore, a 
positive number of identity matrices is not always guaranteed. However, the MUS algorithm 
has proved to be efficient in terms of mean square errors for group means estimation, as 
shown in Table The main steps of the algorithm are summarized below. 


(i) For every group g, g = 1, ...,G, hnd the maxima units within matrix G, i.e. 

the units in group g with the greatest number of zeros in correspondence of the units 
of the other G — 1 groups, where M is a precision parameter hxed in advance (in our 
simulation study M = 5). 

(ii) For these M x G units, count the number of distinct identity sub-matrices of rank G 

7)(jxG) which contain them. 

(iii) For each group g, g = 1, ...,G, select the unit which yields the maximum number of 
identity matrices of rank G. Such unit represents the pivot to be used for relabelling 
the chains as explained in Section 


24 




